Brief Description
We have been provided with Kroger data for academic purose to analyse the datasets and come up with meaningful insights by applying all the data wrnagling and machine learning techniques that we have learnt during our Business Analytics program. The datasets that will be used are not complete datasets and we will be focussing on product, household and transaction datasets. The programming language that will be used to analyse the data will be R.
Problem Statement
To analyse the Transaction, Products and Household datasets of Kroger to answer the following questions:
Approach
Before we begin with our analysis, it is always important to have a look at the data and get a feel of it. So, I’ll be starting with looking at the structure of the data after importing the required datasets in R. Looking at the data helps you see what is wrong with the data and the methods that need to be applied to clean the data. Cleaning the data is an essential part of coming up with meaningful insights and can consume up to 60% of the time and effort. Beginning with the analysis without cleaning the data can have negative consequenses which can lead to undesired results and not so meaningful conclusion. Once the data is cleaned, we will move on to the exploratory data analysis and apply different machine learning techniques to generate some meaningful conclusions.
The analysis will try to answer the questions posed in the problem statement so that the stakeholders who are going to use the use this analysis can target the consumers in a right way to increase the sales of the company.
Packages Required
Following are the packages required to analyse the given datasets:
readr - It provides a fast and friendly way to read rectangular (csv, tsv) datadplyr - dplyr provides verbs that helps in data manipulation of objects like data framelubridate - It is a package to to work with date and timestringr - It is used for character manipulation, pattern matching in strings, etc.DT - Helps in displaying data objects as tables on HTML pagestidyverse - tidyverse is a set of packages. It is designed to load multiple packages simulataneouslycaret - caret package contains functions to streamline the model training process for regression and classification problemsplotly - plotly is used for creating interactive webpages in Rranger - ranger package is used for the implementation of random forestzoo - zoo is used for performing calculations for irregualr time seriesarules - arules is used for frequent items and association rulesarulesViz - viz techniques for association and rules and item-setsknitr - knitr is used for dynamic report generationplyr - plyr contains tools for splitting, applying and combining dataSource
The data represents customer transactions over two years from a group of 2,500 households who are frequent shppers at Kroger. This doesn’t capture the effects of marketing to customers as the focus will only be on partial datsets and not complete datasets leaving out the datasets that contain the “campaign”" data. This data will be used only for academic purpose.
Data Description
Household dataset - It contains demographic information about the selected households
Variables and their description
HSHD_NUM - Uniquely identifies each householdLOYALTY_FLAG - Indicates the loyalty of the customer with the companyAGE_RANGE - Estimated age rangeMARITAL_STATUS - Marital StatusINCOME_RANGE - Household IncomeHOMEOWNER_DESC - Houseowner, renterHSHD_COMPOSITION - Household compositionHH_SIZE - Size of household up to 5+CHILDREN - Number of children present up to 3+Product dataset - It contains information on each product
Variables and their description
PRODUCT_NUM - Uniquely identifies each productDEPARTMENT - Groups similar product together at department levelCOMMODITY - Groups similar product together at lowest levelBRAND_TYPE - Indicates Private or National BrandORGANIC_FLAG - Indicates whether the product is organic or notTransaction dataset - It contains information on each transaction
Variables and their description
BASKET_NUM - Uniquely identifies a purchase occasionHSHD_NUM - Uniquely identifies each householdPURCHASE_DATE - Date when transaction occuredPRODUCT_NUM - Uniquely identifies each productSPEND - Amount of dollars retailer received from saleUNITS - Number of items soldSTORE_REGION - Region where the store belongs toWEEK_NUM - Week of transactionYEAR - Year of saleData Importing and Cleaning
Importing the dataset using the function read_csv from readr package. It came to the notice that some of the column names were not named properly as per the User Guide that was shared with us. So, rectifying the column names of those columns using the col_names attribute. Using the pmap function to import the datasets simultaneoulsy.
household_colnames <- c("HSHD_NUM","LOYALTY_FLAG","AGE_RANGE","MARITAL_STATUS",
"INCOME_RANGE","HOMEOWNER_DESC","HSHD_COMPOSITION",
"HH_SIZE","CHILDREN")
products_colnames <- c("PRODUCT_NUM","DEPARTMENT","COMMODITY","BRAND_TYPE",
"ORGANIC_FLAG")
transactions_colnames <- c("BASKET_NUM","HSHD_NUM","PURCHASE_DATE","PRODUCT_NUM",
"SPEND","UNITS","STORE_REGION",
"WEEK_NUM","YEAR")
files <- list("data/5000_households.csv", "data/5000_products.csv",
"data/5000_transactions.csv")
column_list <- list(household_colnames, products_colnames, transactions_colnames)
column_type <- list("ccccccccc", "ccccc", "ccccdicii")
skip_rows <- list(1, 1, 1)
df_combined <- pmap(list(file = files, col_names = column_list, col_types = column_type,
skip = skip_rows), read_csv)
households <- data.frame(df_combined[[1]])
products <- data.frame(df_combined[[2]])
transactions <- data.frame(df_combined[[3]])
Understanding the structure of the datsets is very important before we begin cleaning the datasets. We can investigate the datasets by various built in functions that are provided in R. So, getting the feel of the data using the following functions.
dimnamesheadtailglimpsesummaryis.naLooking at the structure of household, product and transaction datasets.
households[households == ""] <- NA
products[products == ""] <- NA
transactions[transactions == ""] <- NA
map(df_combined,dim)
map(df_combined,names)
map(df_combined,head)
map(df_combined,tail)
map(df_combined,glimpse)
map(df_combined,~any(is.na(.)))
map(df_combined,~colSums(is.na(.)))
map(df_combined,summary)
After importing the datasets, in bulk of the columns we can see that were a lot of missing values represented by null or Unknown. But, as R recognises the missing values only by NA so replacing the missing values by NA.
for (i in 1:ncol(households)) {
households[,i][households[,i ] == "null" | households[,i ] == "Unknown" |
households[,i ] == "NOT AVAILABLE" ] <- NA
}
sum(is.na(households))
sum(complete.cases(households))
There were variables in the datasets that were categorical in nature but were being represented by character datatype so it was necessary to convert the variables in FACTOR datatype. Using the type conversion method, converting the variables in apprpriate format.
Converting the household variables in FACTOR format.
for (i in c(2,4,6,7)) {
households[,i] <- as.factor(households[,i])
}
unique(households$INCOME_RANGE)
households$INCOME_RANGE <- factor(households$INCOME_RANGE,
order = TRUE, levels =
c("UNDER 35K","35-49K","50-74K","75-99K",
"100-150K","150K+"))
households$AGE_RANGE <- str_trim(households$AGE_RANGE)
unique(households$AGE_RANGE)
households$AGE_RANGE <- factor(households$AGE_RANGE,
order = TRUE, levels =
c("19-24","25-34","35-44","45-54",
"55-64","65-74","75+"))
households$CHILDREN <- factor(households$CHILDREN,
order = TRUE, levels =
c("1","2","3+"))
households$HH_SIZE <- factor(households$HH_SIZE,
order = TRUE, levels =
c("1","2","3","4","5+"))
Converting the product variables in FACTOR format.
for (i in 2:5) {
products[,i] <- as.factor(products[,i])
}
Converting the transaction variables in FACTOR format and date format.
transactions$STORE_REGION <- as.factor(transactions$STORE_REGION)
transactions$PURCHASE_DATE <- dmy(transactions$PURCHASE_DATE)
transactions$BASKET_NUM <- as.numeric(transactions$BASKET_NUM)
Looking at the structure of the data after cleaning the data.
glimpse(households)
## Observations: 5,000
## Variables: 9
## $ HSHD_NUM <chr> "0688", "2590", "1171", "1531", "0403", "0283...
## $ LOYALTY_FLAG <fct> Y, N, Y, Y, N, Y, Y, Y, N, Y, Y, Y, Y, Y, Y, ...
## $ AGE_RANGE <ord> 75+, 75+, 75+, 75+, 75+, 25-34, 25-34, 25-34,...
## $ MARITAL_STATUS <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ INCOME_RANGE <ord> 35-49K, 50-74K, 75-99K, 75-99K, UNDER 35K, 50...
## $ HOMEOWNER_DESC <fct> Homeowner, Homeowner, Homeowner, Homeowner, R...
## $ HSHD_COMPOSITION <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ HH_SIZE <ord> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ CHILDREN <ord> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
glimpse(products)
## Observations: 151,141
## Variables: 5
## $ PRODUCT_NUM <chr> "0092993", "0093924", "0094272", "0094299", "0094...
## $ DEPARTMENT <fct> NON-FOOD, NON-FOOD, NON-FOOD, NON-FOOD, NON-FOOD,...
## $ COMMODITY <fct> PET, PET, PET, PET, PET, PET, PET, PET, PET, PET,...
## $ BRAND_TYPE <fct> PRIVATE, PRIVATE, PRIVATE, PRIVATE, PRIVATE, PRIV...
## $ ORGANIC_FLAG <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N...
glimpse(transactions)
## Observations: 10,625,553
## Variables: 9
## $ BASKET_NUM <dbl> 24, 24, 34, 60, 60, 168, 199, 252, 355, 366, 379...
## $ HSHD_NUM <chr> "1809", "1809", "1253", "1595", "1595", "3393", ...
## $ PURCHASE_DATE <date> 2016-01-03, 2016-01-03, 2016-01-03, 2016-01-03,...
## $ PRODUCT_NUM <chr> "5817389", "5829886", "0539501", "5260099", "453...
## $ SPEND <dbl> -1.50, -1.50, 2.19, 0.99, 2.50, 4.50, 3.49, 2.79...
## $ UNITS <int> -1, -1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1...
## $ STORE_REGION <fct> SOUTH, SOUTH, EAST, WEST, WEST, SOUTH, SOUTH, SO...
## $ WEEK_NUM <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ YEAR <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...
Summary
Kroger’s products are approximately equally divided into food and non-food categories with food category containing 74,958 products and non-food category containing 73,134 products. Spend ranges from -14 to 300. There was a huge number of values of spend in negative so assuming that goods have been returned by the customer and there is areverse entry instead of removing it. Units also have a similar case. They range from -26 to 246 so assuming customers have returned the units. Similarly, the maximum week number is 104 which as per the document provided should not be possible as the maximum week mentioned in the document is 102. But, again keeping it as it is as there are a large number of rows containing the week number as 104. Finally, there were a lot of missing values in certain colums so instead of removing it I decided to keep as it is as it can lead to a bias in the final conclusion without understanding the business conetext.
summary(households, maxsum = 15)
summary(products, maxsum = 15)
summary(transactions)
Displaying top 50 observations of household dataset.
households_top50 <- head(households, n = 50)
datatable(
households_top50, rownames = FALSE,
extensions = 'Buttons', options = list(
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
)
)
Displaying top 50 observations of product dataset.
products_top50 <- head(products, n = 50)
datatable(
products_top50, rownames = FALSE,
extensions = 'Buttons', options = list(
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
)
)
Displaying top 50 observations of transaction dataset.
transactions_top50 <- head(transactions, n = 50)
datatable(
transactions_top50, rownames = FALSE,
extensions = 'Buttons', options = list(
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
)
)
Combining all the three datasets with the help of inner join & Displaying top 50 rows of the combinded dataset
merged_data <- transactions %>%
inner_join(products, by = "PRODUCT_NUM") %>%
inner_join(households, by = "HSHD_NUM")
merged_data_top50 <- head(merged_data, n = 50)
datatable(
merged_data_top50, rownames = FALSE,
extensions = 'Buttons', options = list(
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
)
)
Exploratory Data Analysis
The below graph (Spending vs Units) tells about the spending patterns of households falling under different income brackets. The data is aggreagted for both the years for each household. We can clearly see that households falling under higher income bracket spend more compared to the houses falling in the lower income brackets over the course of two years. Thus, the company can possibly provide more incentives to the households falling in the lower income bracket so that they can increase their spending.
purchase_year <- as.numeric(merged_data$PURCHASE_DATE %>% str_sub(start = 1, end = 4))
merged_data <- merged_data %>% mutate(Year = purchase_year)
spending <- merged_data %>%
dplyr::select(HSHD_NUM, SPEND, INCOME_RANGE, UNITS, LOYALTY_FLAG,
AGE_RANGE,MARITAL_STATUS, HOMEOWNER_DESC,
HSHD_COMPOSITION, HH_SIZE, CHILDREN) %>%
group_by(HSHD_NUM) %>%
dplyr::summarise(TOT_SPEND = sum(SPEND), INCOME_RANGE = unique(INCOME_RANGE),
TOT_UNITS = sum(UNITS), LOYALTY_FLAG = unique(LOYALTY_FLAG),
AGE_RANGE = unique(AGE_RANGE), MARITAL_STATUS = unique(MARITAL_STATUS),
HOMEOWNER_DESC = unique(HOMEOWNER_DESC),
HSHD_COMPOSITION = unique(HSHD_COMPOSITION),
HH_SIZE = unique(HH_SIZE), CHILDREN = unique(CHILDREN))
a <- ggplot(spending, aes(y = TOT_SPEND, x = TOT_UNITS
, color = INCOME_RANGE)) +
geom_point(alpha = 0.5, position = "jitter") +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Total Units",
y = "Total Spending of individual households",
title = "Income wise Household Spending - 2016 & 2017")
ggplotly(a)
Exploratory Data Analysis
The below graph shows the monthly spending of the households in the year 2016 and year 2017. We can clearly see from the graph that there is seasonality. Also, in certain months the spending is more in 2016 compared to 2017 and in certain months the spending is less in 2016 compared to 2017. Smooth lines indicate that the spending in 2017 was strong in the initial months but has taperred off in the later months compared to the year 2016.
purchase_yemo <- merged_data$PURCHASE_DATE %>% str_sub(start = 1, end = 7)
merged_data <- merged_data %>% mutate(YeMo = purchase_yemo)
spending_timeseries <- merged_data %>%
dplyr::select(SPEND, YeMo, Year) %>%
group_by(YeMo) %>%
dplyr::summarise(tot_spend = sum(SPEND), YE = unique(Year))
b <- ggplot(spending_timeseries, aes(y = tot_spend,
x = as.Date(as.yearmon(YeMo)),
color = as.factor(YE))) +
geom_point(alpha = 0.5, position = "jitter") +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(method = "lm", se = FALSE, aes(group = 1, col = "All"), linetype = 2) +
labs(x = "Monthly Index",
y = "Monthwise Spending",
title = "Household Spending month by month- 2016 & 2017",
color = "Year")
ggplotly(b)
Exploratory Data Analysis
The graph below tells us about the monthly spending of organic products in the year 2016 and 2017. We can clearly see that the spending has increased in 2017 for the organic products. The reason might be that people are getting more concious about their health. Another reason might be that the companies have started promoting their organic products more compared to the regular products.
spending_timeseries_org <- merged_data %>%
dplyr::select(SPEND, YeMo, YEAR, ORGANIC_FLAG) %>%
group_by(YeMo, ORGANIC_FLAG) %>%
dplyr::summarise(tot_spend = sum(SPEND), YE = unique(YEAR))
c <- ggplot(spending_timeseries_org, aes(y = tot_spend,
x = as.Date(as.yearmon(YeMo)),
color = as.factor(YE))) +
geom_point(alpha = 0.5, position = "jitter") +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(. ~ ORGANIC_FLAG, scales = "free_y") +
labs(x = "Monthly Index",
y = "Monthwise Spending",
title = "Monthly spending for orgainic products and non-organic products",
color = "Year")
ggplotly(c)
Exploratory Data Analysis
The below graph tells us the region wise spending for 4 different regions. We can see that in the lower spending brackets there are lot of households dominated by the Southern region. The spending is centered close to each other for all the four regions except that the Eastern region’s center is slightly towards right compared to other regions. But,we can see a very thin line in the region of 2500 to 3000 dominated by households by the Western region.
summar <- merged_data %>%
dplyr::select(HSHD_NUM, SPEND, Year, STORE_REGION) %>%
group_by(STORE_REGION, HSHD_NUM, Year) %>%
dplyr::summarise(tot_spend = sum(SPEND))
d <- ggplot(summar,aes(x = tot_spend, fill = STORE_REGION)) +
geom_density(col = NA, alpha = 0.35) +
labs(x = "Household Spending",
y = "Density",
title = "Region Wise Spending",
color = "Store Region")
ggplotly(d)
Exploratory Data Analysis
Finally, this graph tells us about the spending for different commodities for the year 2016 and 2017. We can clearly see that the households spend the most on grocery.
summar2 <- merged_data %>%
dplyr::select(COMMODITY, SPEND, Year) %>%
group_by(COMMODITY, Year) %>%
dplyr::summarise(tot_spend = sum(SPEND))
summar2$Year <- as.factor(summar2$Year)
e <- ggplot(summar2, aes(x = tot_spend, y = COMMODITY, col = Year)) +
geom_point(scale = "free_y", space = "free_y", position = "jitter", alpha = 0.4) +
labs(x = "Spending",
y = "Commodity Type",
title = "Yearly Spending commodity wise",
color = "Year") +
theme(axis.text.y = element_text(size = 8))
ggplotly(e)
Market Basket Analysis
tr <- read.transactions("data/transactions.csv",format = "basket", sep = ",")
itemFrequencyPlot(tr,topN = 20,type = "relative",col = brewer.pal(8,'Pastel2'),
main = "Absolute Item Frequency Plot")
association_rules <- apriori(tr, parameter = list(supp = 0.15, conf = 0.8, maxlen = 10,
target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.15 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 150607
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[43 item(s), 1004051 transaction(s)] done [0.61s].
## sorting and recoding items ... [9 item(s)] done [0.06s].
## creating transaction tree ... done [1.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object ... done [0.10s].
summary(association_rules)
## set of 11 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 5 6
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.545 3.000 3.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.1563 Min. :0.8027 Min. :1.232 Min. :156960
## 1st Qu.:0.1771 1st Qu.:0.8355 1st Qu.:1.282 1st Qu.:177832
## Median :0.1949 Median :0.9183 Median :1.409 Median :195669
## Mean :0.2258 Mean :0.8802 Mean :1.351 Mean :226701
## 3rd Qu.:0.2531 3rd Qu.:0.9238 3rd Qu.:1.418 3rd Qu.:254106
## Max. :0.3428 Max. :0.9264 Max. :1.422 Max. :344195
##
## mining info:
## data ntransactions support confidence
## tr 1004051 0.15 0.8
inspect(association_rules[1:10])
## lhs rhs support confidence
## [1] {HOUSEHOLD} => {GROCERY STAPLE} 0.1948795 0.8027117
## [2] {FROZEN FOOD} => {GROCERY STAPLE} 0.2418503 0.8163423
## [3] {BAKERY} => {GROCERY STAPLE} 0.2643113 0.8387680
## [4] {DAIRY} => {GROCERY STAPLE} 0.3394828 0.8544413
## [5] {PRODUCE} => {GROCERY STAPLE} 0.3428063 0.8322996
## [6] {BEVERAGE - NON WATER,DAIRY} => {GROCERY STAPLE} 0.1563267 0.9263511
## [7] {DAIRY,FROZEN FOOD} => {GROCERY STAPLE} 0.1713967 0.9264407
## [8] {FROZEN FOOD,PRODUCE} => {GROCERY STAPLE} 0.1649468 0.9232635
## [9] {BAKERY,DAIRY} => {GROCERY STAPLE} 0.1869646 0.9242654
## [10] {BAKERY,PRODUCE} => {GROCERY STAPLE} 0.1828323 0.9182553
## lift count
## [1] 1.232053 195669
## [2] 1.252974 242830
## [3] 1.287394 265382
## [4] 1.311451 340858
## [5] 1.277466 344195
## [6] 1.421823 156960
## [7] 1.421960 172091
## [8] 1.417084 165615
## [9] 1.418621 187722
## [10] 1.409397 183573
subRules <- association_rules[quality(association_rules)$confidence > 0.8]
top10subRules <- head(association_rules, n = 10, by = "confidence")
plot(top10subRules, method = "graph", engine = "htmlwidget")
Recency, Frequency and Monetary Value grouping based on K-Means Clustering
purchase_year <- as.numeric(merged_data$PURCHASE_DATE %>% str_sub(start = 1, end = 4))
merged_data <- merged_data %>% mutate(Year = purchase_year)
recent_date <- max(merged_data$PURCHASE_DATE) + 1
df_RFM <- merged_data %>%
filter(Year == 2017) %>%
group_by(HSHD_NUM) %>%
dplyr::summarise(Recency = as.numeric(recent_date - max(PURCHASE_DATE)),
Frequency = length(unique(BASKET_NUM)),
MonetaryValue = sum(SPEND))
df_RFM[,-1] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density() +
theme_minimal()
df_RFM$ScaledRecency <- scale(log(df_RFM$Recency))
df_RFM$ScaledFrequency <- scale(log(df_RFM$Frequency))
df_RFM$ScaledMonetaryValue <- scale(log(df_RFM$MonetaryValue))
wss <- vector("numeric",15)
for (i in 1:15) {
km_out <- kmeans(df_RFM[,5:7], centers = i, nstart = 20, iter.max = 50)
wss[i] <- km_out$withinss
}
plot(1:15,wss,type = "b", xlab = 'Number of Clusters',
ylab = "Within group of sum of squares")
km_out1 <- kmeans(df_RFM[,5:7], centers = 3, nstart = 20, iter.max = 50)
clustered_df <- data.frame(HSHD_NUM = df_RFM$HSHD_NUM, Recency = df_RFM$Recency,
Frequency = df_RFM$Frequency,
MonetaryValue = df_RFM$MonetaryValue, Cluster = km_out1$cluster)
summary_table <- clustered_df %>%
dplyr::select(Cluster, Recency,Frequency,MonetaryValue) %>%
group_by(Cluster) %>%
dplyr::summarise(Recency = mean(Recency), Frequecy = mean(Frequency),
Monetaryvalue = mean(MonetaryValue),
Number_Household = length(Cluster))
gathered_table <- gather(summary_table[1:4], RFM, Mean_value, -Cluster)
ggplot(gathered_table, aes(y = Mean_value, x = RFM, color = as.factor(Cluster))) +
geom_line(aes(group = Cluster)) +
scale_y_log10() +
labs(x = "RFM",
y = "Mean RFM Value",
title = "K-Means Clutering based on RFM")
Stepwise Selection of predictors impacting total houehold spending
The technique that I have used is Stepwise Selection to come up with the best predictors to apply Linear Regression. The purpose of using linear regression is that here we are trying to find the relationship between the spending variable (response) and different predictors. The predictors that were shortlisted based on STepwise Selection were income range, unit, household description and marital status.
linear_data <- merged_data %>%
dplyr::select(HSHD_NUM,AGE_RANGE,MARITAL_STATUS,
INCOME_RANGE,HSHD_COMPOSITION,HH_SIZE,
CHILDREN, SPEND, LOYALTY_FLAG, UNITS,
STORE_REGION, HOMEOWNER_DESC) %>%
group_by(HSHD_NUM) %>%
dplyr::summarise(SPEND = sum(SPEND),
INCOME_RANGE = unique(INCOME_RANGE),LOYALTY_FLAG = unique(LOYALTY_FLAG),
AGE_RANGE = unique(AGE_RANGE),
CHILDREN = unique(CHILDREN), HOMEOWNER_DESC = unique(HOMEOWNER_DESC),
MARITAL_STATUS = unique(MARITAL_STATUS),
HSHD_COMPOSITION = unique(HSHD_COMPOSITION), HH_SIZE = unique(HH_SIZE),
UNITS = sum(UNITS))
linear_data <- linear_data[,-1]
map_dbl(linear_data, function(x){sum(is.na(x))}) %>%
sort(decreasing = TRUE) %>% .[. > 0]
## CHILDREN HOMEOWNER_DESC MARITAL_STATUS AGE_RANGE
## 3075 1246 1207 997
## HSHD_COMPOSITION HH_SIZE INCOME_RANGE
## 934 933 865
linear_data <- linear_data[,-5]
entire_model <- lm(SPEND~., data = na.omit(linear_data))
null_model <- lm(SPEND~1, data = na.omit(linear_data))
stepwise_selection_full <- step(null_model, scope = list(lower = null_model
,upper = entire_model),
direction = "both")
## Start: AIC=59409.38
## SPEND ~ 1
##
## Df Sum of Sq RSS AIC
## + UNITS 1 1.3818e+11 1.4404e+10 51457
## + LOYALTY_FLAG 1 4.0783e+09 1.4850e+11 59320
## + HH_SIZE 4 1.5267e+09 1.5106e+11 59383
## + AGE_RANGE 6 1.1104e+09 1.5147e+11 59397
## + HSHD_COMPOSITION 5 6.0423e+08 1.5198e+11 59406
## <none> 1.5258e+11 59409
## + MARITAL_STATUS 1 2.3418e+07 1.5256e+11 59411
## + INCOME_RANGE 5 3.8384e+08 1.5220e+11 59411
## + HOMEOWNER_DESC 1 1.6016e+05 1.5258e+11 59411
##
## Step: AIC=51457.43
## SPEND ~ UNITS
##
## Df Sum of Sq RSS AIC
## + INCOME_RANGE 5 5.7999e+08 1.3824e+10 51329
## + HOMEOWNER_DESC 1 1.0488e+08 1.4299e+10 51435
## + HSHD_COMPOSITION 5 9.3793e+07 1.4310e+10 51445
## + MARITAL_STATUS 1 5.0398e+07 1.4353e+10 51448
## + HH_SIZE 4 4.3577e+07 1.4360e+10 51455
## + AGE_RANGE 6 5.2266e+07 1.4351e+10 51457
## + LOYALTY_FLAG 1 9.1878e+06 1.4394e+10 51457
## <none> 1.4404e+10 51457
## - UNITS 1 1.3818e+11 1.5258e+11 59409
##
## Step: AIC=51328.92
## SPEND ~ UNITS + INCOME_RANGE
##
## Df Sum of Sq RSS AIC
## + HOMEOWNER_DESC 1 6.1786e+07 1.3762e+10 51316
## + MARITAL_STATUS 1 2.3277e+07 1.3800e+10 51325
## + AGE_RANGE 6 5.3600e+07 1.3770e+10 51328
## + HSHD_COMPOSITION 5 4.1057e+07 1.3783e+10 51329
## <none> 1.3824e+10 51329
## + LOYALTY_FLAG 1 2.8267e+06 1.3821e+10 51330
## + HH_SIZE 4 1.4566e+07 1.3809e+10 51333
## - INCOME_RANGE 5 5.7999e+08 1.4404e+10 51457
## - UNITS 1 1.3838e+11 1.5220e+11 59411
##
## Step: AIC=51315.83
## SPEND ~ UNITS + INCOME_RANGE + HOMEOWNER_DESC
##
## Df Sum of Sq RSS AIC
## + MARITAL_STATUS 1 8.7127e+06 1.3753e+10 51316
## <none> 1.3762e+10 51316
## + LOYALTY_FLAG 1 2.6717e+06 1.3759e+10 51317
## + HSHD_COMPOSITION 5 3.4185e+07 1.3728e+10 51317
## + AGE_RANGE 6 3.6780e+07 1.3725e+10 51319
## + HH_SIZE 4 8.3707e+06 1.3754e+10 51322
## - HOMEOWNER_DESC 1 6.1786e+07 1.3824e+10 51329
## - INCOME_RANGE 5 5.3690e+08 1.4299e+10 51435
## - UNITS 1 1.3844e+11 1.5220e+11 59413
##
## Step: AIC=51315.69
## SPEND ~ UNITS + INCOME_RANGE + HOMEOWNER_DESC + MARITAL_STATUS
##
## Df Sum of Sq RSS AIC
## <none> 1.3753e+10 51316
## - MARITAL_STATUS 1 8.7127e+06 1.3762e+10 51316
## + LOYALTY_FLAG 1 2.3356e+06 1.3751e+10 51317
## + HSHD_COMPOSITION 4 2.5472e+07 1.3728e+10 51317
## + AGE_RANGE 6 3.2884e+07 1.3720e+10 51320
## + HH_SIZE 4 1.1109e+07 1.3742e+10 51321
## - HOMEOWNER_DESC 1 4.7222e+07 1.3800e+10 51325
## - INCOME_RANGE 5 5.2434e+08 1.4278e+10 51432
## - UNITS 1 1.3843e+11 1.5218e+11 59414
best_model <- lm(SPEND~ INCOME_RANGE + UNITS + HOMEOWNER_DESC + MARITAL_STATUS,
data = na.omit(linear_data))
summary(best_model)
##
## Call:
## lm(formula = SPEND ~ INCOME_RANGE + UNITS + HOMEOWNER_DESC +
## MARITAL_STATUS, data = na.omit(linear_data))
##
## Residuals:
## Min 1Q Median 3Q Max
## -18874.8 -854.3 -120.6 744.8 30992.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 322.43711 59.66965 5.404 6.98e-08 ***
## INCOME_RANGE.L 1030.94442 96.64390 10.667 < 2e-16 ***
## INCOME_RANGE.Q -22.41218 94.34536 -0.238 0.812241
## INCOME_RANGE.C -85.39749 89.22845 -0.957 0.338603
## INCOME_RANGE^4 76.83613 85.81353 0.895 0.370646
## INCOME_RANGE^5 -32.53545 84.93640 -0.383 0.701702
## UNITS 2.72862 0.01484 183.925 < 2e-16 ***
## HOMEOWNER_DESCRenter -370.50425 109.06625 -3.397 0.000689 ***
## MARITAL_STATUSSingle -112.60097 77.16717 -1.459 0.144608
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2023 on 3361 degrees of freedom
## Multiple R-squared: 0.9099, Adjusted R-squared: 0.9096
## F-statistic: 4241 on 8 and 3361 DF, p-value: < 2.2e-16
Cohort Analysis
cohort_data <- merged_data %>%
dplyr::select(HSHD_NUM,PURCHASE_DATE,Year)
cohort_data$PURCHASE_DATE <- ymd(format(cohort_data$PURCHASE_DATE, "%Y-%m-01"))
Join_Date <- cohort_data %>%
group_by(HSHD_NUM) %>%
dplyr::summarise(Joindate = min(PURCHASE_DATE))
cohort_data <- left_join(cohort_data,Join_Date, by = "HSHD_NUM")
years_diff <- as.numeric(format(cohort_data$PURCHASE_DATE,'%Y')) -
as.numeric(format(cohort_data$Joindate,'%Y'))
months_diff <- as.numeric(format(cohort_data$PURCHASE_DATE,'%m')) -
as.numeric(format(cohort_data$Joindate,'%m'))
cohort_data$monthindex <- years_diff * 12 + months_diff + 1
cohort_data$JoinMonth <- format(cohort_data$Joindate,"%Y-%m")
cohort_final <- cohort_data %>% group_by(JoinMonth, monthindex) %>%
dplyr::summarise(hshd = length(unique(HSHD_NUM)))
cohort_spread <- spread(cohort_final,monthindex,hshd)
for (i in 3:25) {
cohort_spread[,i] <- round(((cohort_spread[,i] / cohort_spread[,2]) * 100), 1)
}
cohort_spread[,2] <- round(((cohort_spread[,2] / cohort_spread[,2]) * 100),1)
cohort_spread <- cohort_spread[2:23,]
breaks <- quantile(data.frame(cohort_spread[,2:24]), probs = seq(.05, .95, .05),
na.rm = TRUE)
colors <- sapply(round(seq(155, 80, length.out = length(breaks) + 1), 0),
function(x){ rgb(x,x,155, maxColorValue = 155) } )
# The Retention Mixpanel with counts
DT::datatable(cohort_spread,
class = 'cell-border stripe',
rownames = FALSE,
options = list(
ordering=F,
dom = 't',
pageLength = 23) ) %>%
formatStyle("JoinMonth",
backgroundColor = 'lightgrey',
fontWeight = 'bold') %>%
formatStyle(names(cohort_spread[-1]),fontWeight = 'bold',
color = 'white', backgroundColor = styleInterval(breaks,colors))